Local nearest neighbour classification with applications to semi-supervised learning
نویسندگان
چکیده
We derive a new asymptotic expansion for the global excess risk of a local k-nearest neighbour classifier, where the choice of k may depend upon the test point. This expansion elucidates conditions under which the dominant contribution to the excess risk comes from the locus of points at which each class label is equally likely to occur, but we also show that if these conditions are not satisfied, the dominant contribution may arise from the tails of the marginal distribution of the features. Moreover, we prove that, provided the d-dimensional marginal distribution of the features has a finite ρth moment for some ρ > 4 (as well as other regularity conditions), a local choice of k can yield a rate of convergence of the excess risk of O(n−4/(d+4)), where n is the sample size, whereas for the standard k-nearest neighbour classifier, our theory would require d ≥ 5 and ρ > 4d/(d − 4) finite moments to achieve this rate. Our results motivate a new k-nearest neighbour classifier for semi-supervised learning problems, where the unlabelled data are used to obtain an estimate of the marginal feature density, and fewer neighbours are used for classification when this density estimate is small. The potential improvements over the standard k-nearest neighbour classifier are illustrated both through our theory and via a simulation study.
منابع مشابه
Scene Classification Via pLSA
Given a set of images of scenes containing multiple object categories (e.g. grass, roads, buildings) our objective is to discover these objects in each image in an unsupervised manner, and to use this object distribution to perform scene classification. We achieve this discovery using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature, here ap...
متن کاملEnhancing the Performance of Semi-Supervised Classification Algorithms with Bridging
Traditional supervised classification algorithms require a large number of labelled examples to perform accurately. Semi-supervised classification algorithms attempt to overcome this major limitation by also using unlabelled examples. Unlabelled examples have also been used to improve nearest neighbour text classification in a method called bridging. In this paper, we propose the use of bridgin...
متن کاملNearest Neighbour Classification with Background Knowledge Extended to Semi-supervised Learning
Semi supervised methods involve converting unlabelled data into high quality labelled data that can be used to improve the performance of conventional supervised methods that had previously been given a small training set. Unlabelled data has also been shown to be helpful in a supervised setting called ‘bridging’ where unlabelled data have been used to help relate labelled instances to those th...
متن کاملSemi-Supervised Self-Organizing Feature Map for Gene Classification
In this thesis, a study on gene expression data analysis is done using some supervised, unsupervised and semi-supervised approaches. The task of class prediction for six gene expression datasets (namely, Brain Tumor, Colon Cancer, Leukemia, Lymphoma and SRBCT) has been carried out. Here, a one-dimensional self-organizing feature maps (SOFM) in a semi-supervised learning framework is developed f...
متن کاملAnomaly Detection
This chapter presents an extension of conformal prediction for anomaly detection applications. It includes the presentation and discussion of the Conformal Anomaly Detector (CAD) and the computationally more efficient Inductive Conformal Anomaly Detector (ICAD), which are general algorithms for unsupervised or semi-supervised and offline or online anomaly detection. One of the key properties of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1704.00642 شماره
صفحات -
تاریخ انتشار 2017